R Configuration

Our R Configuration is as follows:

sessionInfo(package=NULL)
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.4
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] backports_1.0.5 magrittr_1.5    rprojroot_1.2   tools_3.3.2    
##  [5] htmltools_0.3.5 yaml_2.1.14     Rcpp_0.12.10    stringi_1.1.3  
##  [9] rmarkdown_1.4   knitr_1.15.1    stringr_1.2.0   digest_0.6.12  
## [13] evaluate_0.10

Data

Our original data is sourced from Data.World. This data includes visits by National Parks from 1904-2016. We joined this data with population and gender breakdown by state from the US Census Data. Finally, we joined Latitude and Longitude points for each of the 50 U.S. States, as well as Washington D.C.

Initial Analysis

Our initial analysis was performed in Tableau.

We began by creating a boxplot that showed how national park visits differed by operating region. The below image shows how the National Parks Service breaks its parks into different regions:

The boxplot shows:

Boxplot: Regions vs. Visits

Scatterplot: Visits vs. Years

With that info in mind, we created a scatterplot of visits through the years, colored by regions. These trends were then used to create our first Interesting Visualization, which show visit growth through the years.

Barchart: Region & State by Visits

Additionally, we decided to break visits down by both state and region, and create a window average calculation to find the average number of visits between regions, and how each state compared.

Barchart: Park Name vs. Visits

Moving forward, we decided to try and analyze which parks had the highest numbers of visits over this time period. We created the following visualization, and then selected all parks with visits greater than 100 million into a new set. This set was then used to create our second Interesting Visualization.

Join: State vs. Visits, Colored by Male-Female Ratio

We joined data from the U.S. Census that included the number of males, females, and total population for each State. Using this, we created a calculated field that gives the ratio of males to females in each state. The following visualization attempts to find a trend in visits by gender breakdown, though none is immediately obvious:

Data.World Pull and ETL Operations

With our Tableau analysis complete, we pulled the data into R Studio.

Data.World Pulls

First, our initial National Parks Visits data was pulled:

source("../01 Data/prETLNatVisPull.R")
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: data.world
## 
## Attaching package: 'data.world'
## The following object is masked from 'package:dplyr':
## 
##     query
## Joining, by = "State"
## [1] "Success!"

Next, the Census Data is pulled:

source("../01 Data/censusPull.R")
## [1] "Success!"

Finally, the Latitude and Longitude information is pulled:

source("../01 Data/stateLatLongPull.R")
## Warning: Duplicated column names deduplicated: 'Location' =>
## 'Location_1' [6]
## [1] "Success!"

ETL

Our final task is to perform ETL operations. This includes the removal of blank data, as well as character formatting to better meet our needs for further visualization in R.

source("../01 Data/natVisETL.R")
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated

## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## [1] "Success!"

The ETL script is available to view in “01 Data,” or as an image here.

Interesting Visualizations

1: Massive Success of “Mission 66”

Of note here: first, National Park visits actually grew during the Great Depression and Great Recession, indicating that visits to these parks aren’t considered discretionary spending. Additionally, it could be that a poor economy has consumers scrambling to find cheaper vacations, and National Parks fill that role. Lastly, stress during difficult economic times may spur people to want to be outdoors and enjoy nature.

However, the most interesting aspect of this is the massive growth experienced in National Parks visits during “Mission 66.” Mission 66 involved a massive amount of infrastructure creation to make parks more accessible. In addition to roads and trails, it also funded the creation of camping and housing sites, and also an advertising campaign that promoted the natural beauty of these parks to citizens across the country.

3: Visits to Population Ratio

This visualization shows visits by state, filtered by year, and colored by the ratio of visits to state population. The goal here is to identify states or areas that bring in vastly more visitors than their populations. Washington, D.C., a city with a population of approximately 670,000 people, draws almost 55 times its population in visitors to its National Parks and Monuments. Another state where this occurs is Wyoming, which draws nearly 17 times its population in visits.

Shiny Deployment

We took the .CSV files produced in our above ETL and Pull operations, and in turn uploaded them to Data.World. From there, our Shiny app uses SQL to query relevant data to create a variety of visualizations in R. Our Shiny deployment displays the steps taken to reach the interesting visualizations.